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This paper brings to- first fruition an analytic^' schema based on four elements. 
Yhese involve a conception of skills independent of particular testing devices: 
tffe development and application of class of statistical models incorporating 
qualitative definitions of skill, distorted in item response by errors con- 
ceived as misclassifications; a critique and reformation of the concept of 
test vdl idi ty--making more concrete and specific the implications of invalidity; 
and an integration anct fusion of these concepts which allows meaningful em- 
pirical analyses of i'tem response cfd1?a. We believe that thjs conception/model . 
wil l^.,^C2J3^ribute to the clarification of previously intrktable technical and 
policy issues in the testing field. 
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' ' 1 • • The Test Scene: Standards of Perf omTance , Test Instruments^ a nd ' 
Educational Assessment ~ 

» ; . -s • 

' Historically, the purpose of educationaj^ and psychological test instruments, 
has been to ground decisions about individuals; mai nly -to' sort individuals 
into groups of relatively homogeneous intelligence, abi 1 i ty performance, 
or achievement. The use of achievement tests for program or system evalua- 
. - tion is relatively new. It jias been strongly advocated only during t>e last 
decade. Accordingly, tests that were originally designed to compare and 
sort irtdividuals, such as standardized achievement tests,- have been aad are 
currently also widely used in the ^val uation of educational programs and 
systems. - Increasingly widespread state testing programs commonly use stan- 
dardized tests for. assessing pupil performance statewide and a.t district 
levels, but oft?n they cTlso provide test score information to schools and* 
teachers about their pupils so as to ease and improve local decisions- arbout 
pupil instruction. ' 

Sta-ndardized, norm-referenced tests, prim'^arily designed to position pupils re- 
lative to one another and to typical performance levels ('^norm'' di stributtons) 
• ' on an achievement continuum, are still the most common test, type jn use^^^^ - 



both for such individual assessments and for program or school system eValu- 
ation. This type of test, almost exclusively, is also used to predict future 



almost exclusively, is also used to preaict lULure 
performance -of individuals. . ^ . 



•College entrance examinations, such as the Scholastic Apti tude* Tes t *(SAT}^ 

general aptitude batteries, and standardized intelligence tests are .among tfie 

If 
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most cormon, if problematic, devices for such individual performance pre- * 

dictions. These norm-ref ere'nced tests have severe shortcomings, however. 

With growing concern about educational goals, their accomplishment through 

♦ 

specific progratms, and the assessment of their attainment by individuals, 
'displeasure with norm-referenced tests has increased- Dissatisfactions have 
arisen because these tests do not address specific, defined goals and objec- 
tives and their mastery by individuals. 

Objective-, doma#f-, and criterion-referenced tests*, all of which focus on 
specific content, objectives, goals, and achievements to be reached, have 
emerged- The development of such .tests was aTso impelled by the increasing 
resources, human and material, available to teachers, allowing them to , , 

individualize instruction with respect to content and goaTs, which in turn* 
'necessitated individually tailored assessments of pupil achievement. A 
third movement, born because of dissatisfaction with the achievements of high 
school graduates, has adjoined itself ^nd together they have compelled the 
development of tests linked directly to educational goals. The need for mini- ^ 
mal performance standards for graduation 'and promotion has 'promoted new test 
types. 

All of these evolvements have initially concentrated on the assessment of in- 
' dividuals, primarily within single classrooms. Objective- Snd domain-referenced 
instruments are designed to'-allow concrete specification of the goals of 
measurement, "fheyorly current extension of objective-referenced testing beyond 
the classroom is attemptc^by the National Assessment of Educational Progress 
and similar state testing programs. The National Assessment measures performance 
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of a, nationwide sample of pupils in various content areas by means of for- 
mal specification of educational objectives. But their test reporting has 
severe'limitations: Thei^repofts do not permit, thus far, summarization 
of .performance' on test items into levels or patterns, allowing potential 
comparisons to performance stancfardSo / 

New purposes of testing require the rethinking and modification of old pro- 
cedures* And ijpeaningful use of educ^ional test data for nationally or 
regionally representative assessments of .the proportions of individuals 
meeting educationally relevant standards would demand combinations of exis-_ 



ting concepts 



in new operational forms. - 



Criterion-refeVenceck tests have been constructed with narrow content ranges, 
because of ^heir use for instructional decisions about individuals in class- 
rooms or courses. Standardized, norm-referenced tests cover broader ranges 
of content, because of requisites for nationwide applicability and their less 
frequent administration to individuals, at most once or twice during a school 
year. Objecti ve-referenced- instruments , used in the National Assessment and 
intended for extensive evaluation of AmeVican education, encompass still wider 
ranges of accomplishment within content areas. This breadth of^scope is made 
possible by the absence of the usual requirement of accurate. measurement for 
every individual. 

So, if we are to use thi concept of a standard or performance criterion for 
more general purposes than individual assessment, new varieties of testing 
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devices must be developed or-important modifications of existing instruments 
and procedures undertaken. Thus, either criterion-referenced tesl^nd the 
standards they assume nee^xtension to -broader, content areas without losing 
the meaning of specif^'ity of their criterion levels, or wider ranging tests 
must be equipped with such standards in order t-o serve new purposes. 

It is possible to set performance standards and compare them to performance 
on tests which were not specifically designed for that purpose. This is surely 
not the most desirable state of affairs, but may.be the wisest one at the begin- 
ning when we are exploring the best ways to accomplish our new goaTs. The 
intent of this paper is, in fact, to use existing--nationany representati ve-- 
standardized test data to estimate the propo^rtions of elementary^ school pupils j 
in educationally meaningful performance categories. ^ 

2. Validity Reconsidered 

Most yecent psychometric work on v a^l idi ty-rel ated matters has focussed on the 
use of tests for selection decisions. .This work has been stpongly stimulated 
by legal concerns about the fairness of selection [procedures; primarily those 
used in the employment process. The focus this research has not been _on 
the nature of.th& tests themselves or the measurements deriving from them, but 
on the social selection procedures that incorporate these tests. Thus, the im- ' 
plications of the work for changes in- the process relate only to the ways in , 
^ which the scores of individuals with di f^erent" non-test characteristics are 
incorporated into the criteria for selection, not to such issues as item^ontent, 
item format, method of scoring, etc. 
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'As a general perspective, this orientation fragments the valic^ity concept- 
— as tests used in different ways--and it forecloses whole classes of 
questi-ons that relate to item and test format,, content selection, scQring 
and scaling. From our perspective, the new work does not focus on tes^ 
val'idity at all. It primari'ly is a conceptual framework \and a. set of stan- _ 
dards for assessing the social worth of selection procedures incorpora^i ng _ 
any cr^iteria that are (a) quantitative, and (t/J measured with error. Prob- 
lematical ly, 'it focusses primary attention on external criteria and^allows 
those who should be forced to attend 'to impo'rtant concerns ^bout the validity 
of their devices to ignore them. 

Almost all other psychometric researc^i,-^unti 1 recently, has_ been focussed on 
issue's of error and reliability rather than on bias and validity. The theo- 
retical framework for the analysis of measurement errors has become conceptually 
sophisticated, elaborate and full of concrete detail. It has progressed to^ 
'the., point that'^rimi ti ve correlational jndices.are no longer scientifically 
respectable .as having clear meaning and where the conceptual and analytic 
frameworks for test items and responses to them are fully integrated with 
those for t^st^scores, . ' ^ 

♦ • 

On the other hand, the conceptual orientations tp validity of te^t^ are diffuse, 
fragmented and fundamentally incomplete. The widely accepted rubric of "con- 
struct validity" (Cronbach & Meehl, 1955) is abstractive enough so that it gives 
jttle or no guidance in the choice of operational 'procedures or the allocation 
of investigative resources. The decision-theoretic analysis of selection 
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decisions (Cronbach & Gleser,' 1057), is not integrated in any fgndamenta^l fashion 
v/ith the construct framework. The recent ttieoreti cal^work on selection bi-as builds_ . 
on 'the^ decision frame but ^gain ignores the "construct" issu^. In fact, 
fhe whole issue of test^'bi as"- -at its heart a phenomenon of di/f ferentia)^ 
validity--has never been linked "to the core theoretical concepts of validity. 
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Finally, in this area, the frameworks for item assessment have never been • 
fundamentally integrated with those for tests. TKus, "item bias" has no bearing.^ 
on^"test bias/' and ".content vali(;iity," which, at the operationa.l level, seems .to 
me*n the sampling or selection processes for the items which make up the test,.- • 
has no relational to test validity, which at the operational level, seems to* 
mean a relation to a single external criterion in the (implicit or explicit) 
context of a selection decision. The fact these non-overlapping, processes can , 
be tenuously link'ed via the vagaries' of "constj^ct validity" does not imply 
that they could actually be integrated. , ' 

Inherently, the riotion of test validity must rest on two conceptions: - (a) that 
which a test ought to measure, and ■ (b) that which a test doe's measure. It is 
the discrepancies between the two, somehow defined, that bear on validity. 
Central theoretical and practical problems for psychometrics are (1) the mode 
o^specification of the ought and (2) the form of expression of the discrepancy. 
Recent disc^jTsions of the validity concept in the psychometric literature 
'{Cronbach, 1971; 1980) have focussed on. the wo'rd interpretation as^ t?<ie entity 
which is validated. However, ^central interpretation of "interpretation" has, 
at feast skice Cronbach and Meehl (1955)*. centered on the idea of a definition 
or theoretical conception of what is intended to be measured -(i.e., the "construct" )^ 



--our ought.''- Th# problem with the specification 'of the ought is that, if ^it 
.occurs at ajl in the actual world of test^ con'struction-rBeyond an undefined 
^ab9l--i tfi^^'fonnul'ated "in ways that make yL.di ff i tul t to separate valTd from 



invalid^components of the^measurernent^'s., 




Crohbach (1571)- g^iv^es a salient example of a sp^^ci f ication; Of an intent of 
measurement which highlights this i^sue af separation: 

* Consider further rea<iing comprehension as a trait 

construct. Suppose that the test presents- pay^a- 
graphs each followed by multiple-choice ques^^orvs. 

^ ' 'The paragraphs obviously call fort reading and , 

pres^umably contain the information -needed to answer 
the questions. Can a question' about what the test 
measures ar^He? It can, if any conterinterpre- 
*tation may reasonably be advarrced. Here are a few 
.vcounterhypotheses (Vernon, 1962):. - ^ * ' 

1 , ^ The 'test fs given with a time 1 imit.* . Speed of 
reading mav^ontribute appreciably to the score. 'The 
publisher c^ims that the time Iwit is generous.' 
• . But is it? 

" ' 2. These paragraphs seem abstract and full. Perhaps 
• ^ble rea'ders who have little- motivation for academic 

X work mak^little effort and therefore earn Tow scores^ 

"3. The questions seem to call only for necall of w 
facts presented in simple sentences, Orve wants to" 
measure ability to comprehend- at a higher^level than ^ 
vJord recegni tiop and recal 1 . ^ 

4. Uncommon words appear in th« paragraphs. Is the 

score more a measure of vocabulary than of reading ^ . , 

comprehensi on? 

''5. Do the students who earn good scores really demon- 
strate superidr reading or only a styoerior test- taking 
strategy ? Perhaps' the way to earn a good score is to 
-\ read the questions first and look up the answers in 

the paragraph. ^ \ , S 



6. Perhaps this is & test of information in wJnich a 
well-informed student can give gojjd responses without 
reading the paragraphs at all. 

• • These miscellaneous challenges express fragments of a 

V . ' definition or theoretical conceprtion of reading compre- 

£ ^ hensipn'that, if stated e*pl icitly, might begin: "The 

student considered superior in readi^ng comprehension is 
- one who*, ^if acquainted with the words ir>' a paragraph, 
will be able to derive from the iDaragraph- the 'same 
^conclusions that other educated readers; previously^/ 
uninformed on the subject of'the paragraph, derive.'' 
* Just this one sentence separates superior vocabulary, 

reading speed, information, and other counterhypotheses 
from the construct, reading c9mprehension. The con- - 
. struct is not identified with the whole complex practi- 

fjl^ ' cal 'task of reading, where information and vocabulary 

surely contribute tq^uccess. A distinctive, separate 
skill is hypothesizedF. (pp. 463-46^) 

Cronbaeh's, example implies several things in this context. First, it makes 
clear that reading comprehension as an intent of measurement is not all 
'things to all persons; it is not speed, vocabulary, test-wiseness , or prior ^ 
infomation, regardless of whether these "construrf^s" contribute to success 
on test- task itself, other tasks given contemporaneously, or future tasks.* * 
If we take this fjlflbkand realize that such sources of invalidity in the 
asspssment of .pef^T comprehensi on are (a) themselves valid intents of 
measurement -with other instruments and are (b) irremovable sources of variation 
in ''test performance for many "constructs"'' then two further implications flow 
--the problem. ot tes6 val idation, whether focu^sed on 
■the notion of " interpretafion" or not, cannot b^ 
shifted entirely to an analysis of test use, and that 



E g., vocabulary knowledge is a logical prerequisite for appropriate perform- 
ance On comprehension -test tasks. Although variation in performance- due ^ 
differences in vocabulary can be suppressed by experimental training or * 
'se>ectiop of tonmoh words, it cannot be removed" as a source of extraneous 
ifinvalid) variation in practical test situation-s. 
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--the labeling of the terft or the clpscription of what 
it is intended to measure must be sufficiently precise 
to allow^the separation of components of invalidity 
.from valid variations in performance. 
Also, w^must note that these sources of i n\//l i'di ty are of terL P'osi ti vely re- 
lated to the characteristic that is the intent of measurement. Thus, in 
the Cronbach examp^le, those who have the s^s necessary for "comprehensibn" 
of passage content or derivation of correct conclusions, given adequate 
Vocabulary, will also be more likely to have previously acquired that vocabu- 
lary knowledge.'' ■ > 

* 

Our ongoinci^ program of research, of which thir study is a part, is fundamentally 
affected by these i^suesY' For example, a "reading comprehension" test might 
produce' scores which strongly correlate with vocabulary knowledge for several 
^stinguishable reasons: ^ 

--those individual's who have good comprehension skills y 
also generally have extensive vocabulary knqwledge ' ^ 

and vice versa, i.e^, reading comprehension skill(s) 
, is (are) highly correlated with vocabulary knowledge and 

(a) the test primarily measures reading compref^ension or 

(b) the test primarily itfbsures vocabul ary -knowledge 
--those individuals who haVe good comprehension skills 

do not necessarily have extensive vocabulary knowledge, 
i^e., reading comprehension and vocabulary knowledge are 
not highly correlated and 

(c) the^test primarily measures vocabulary knowledge. 
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If someone were t« use the test^for a predictive purpose where* at least 
on 'the surface/th€ test latfel'was not considered of basic importance^ 
that person might be inconcerned about which of these were actually the case. 
However, if one' w4re engaged in placement of individuals in remediation 
' programs in reading, one might hot be concerned about (a) or (b) but (c) 
would be troublesome. And if one were evaluating a Curriculum which might 
change the relation 'between reading comprehension and t^pcabulary knowledge 
'or etigaging in a national^social assessment of reading comprehensi on' abi Vi ties 
'tt^kn only (a) would constitute a satisflctory state of affairs. I 

As this study is'f^ocussed on the latter issue--social assessment of competencies 

"is readiag comprehension for a national popul ati on-- these valjdity issues are 

critical. In order to genera^ valid estimates of the proportion of individuals 

nationally, possessing specific levels of reading skill, we must be able to 

remove variati ons'^^nd biases ^deriving from other, distinct, cnaracteri sti cs-- 

whether they be vocabulary knowledge .or test-wiseness. 

(ft, 

3. -Valid and Meaningful National E>stimates of Reading Comprehension Skill 
At -an earlier sta^ of this project, Haertel (1980) conducted a study of stan- 
dardized reading comprehension tests using- l^e national samples of response 
data. Three of those' samples are here analyzed along with three additional 
ones. In the earlier study Haertel attempted to differentiate among a set 
of distinctively defined s+;iils based on a linguistic analysis of the reading 
compK^henSion test tasks./llhese skills were defined so t|iat each test item 
■^required a . specif^S^fliiK of the skills. An item respcfnse model was formu- 
lated so that individuals were assumed to belong to either a group possessing 
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'■ . , I 

&^1J_ of the skills in the subset--the "can solve" group--or a group^ not po-/ 
ssessing all o^ those skill*--the "rannot solve" group. Indi vf dual i|^ these 
two categories were not assumed to respond unifomly with correct and in- ^ 
correct answers, respectively. Instead, non-matching responses were*allowed 
to occur^ with specific probabj 1 i ties--so-cal 1 ed false negatives and false 
positives. Statistical analyses of theVesponse data using the model then ^.f^^ 
"ytel^ded estimates of two distinctive types of quantities 

a) proportions or numbers of individuals with Various 
combinations of skills ("latent state probabilities") 
and ' - , 

b) proportions of mismatching responses deriving from 
each item ("misclassif ication probabilities').- 

" V 

The major findi^s of the research were that 

a) the models'fit the data extremely wel l--extensive 
explorati^OQ of potential lack of fit resulted in ^ 
no evidence of systematic deviations and 'the analyses 
^ showed that tbe modelt were at least as adequate as , 

' previous psychometric models with more parameters. 

f)) The reading comprehension tests analyzed were not 
' ^ . sensitive enough to allow differentiation of subskills 

--i.e., the models fit the data weTl with only one 
• ' . "^generic skfl'l specified for each test level. Thus, 

'a Single common dichotomy (fan and cannot solve) was 
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sufficient to account for differences Wi t»he 

*« 

reading comprehension skills assessed. by all 
items in a test at a specific level. 
Th^s;? results led us to two conclusions: 

1. Standardized reading comprehension tests may not 
have the discriminating power attributed to them by 
those who focus primaj^^ on available reliability 
coefficients. I.e., if, as discussed above, *such 
tests can only grossly discriminate between two gross 

J ' skill categories, then there must be large elements of 
•the' rel iable variance in such tests which are invalid, 
and 

2. l<f,such iqvalid components are actually "stripped 
off"* by the models used, then perhaps analyses could be 

^ conducted which would yield valid and meaningful national 
estimates of reading comprehension sk-ills, defined at 
least in the, broad terms corr^ponding to the- test levels. 

In the study reported in this paper, 'we imple'^ent the methodology and the 
conceptual .framework" applied by Haertel (1980^ using six nationally repre- 
sentative samples of elementary school pupils--one for each of grades 

through ^six. for these samples we estimate. 

•* -» 

a) the proportions of individuals irueach grade at 

particular skill levels, and 

b) the proportions of matcljing and mismatching responses 
for ^ach item at each grade level. 
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Because the tests used at elch grade level were repeated at adjacent levels, 

f 1^ 

we are then^able to trace Jii^ges in^the proportions of skilled individuals, 

over 'grades and observe systematic modifications in validity of^e observed 
responses. . • ' ' 

4. Study Design: Model > Data> Analysis. ^^-^'^^ . / 

4 J. The Model: Lategt- -States ,^ Latent Responses, and Misclassifications 

If students' responses to items reflected only the skills they ^possessed and 

the skills the items required,Jit wouj.^'^^B^p^ to establisn just which 

patterns of responses to a set of items sh^lWoccur, and w^hich should not. 

For any combination of skills possessed, itemrrequiring' the^e \ills (or 

some of them) and no others would be answered correctly, and items requirijig 
♦ 

■skills not possessed wouldibe answered incorrectly. Only a small number of 
the possible response patterns would be expected to'occur. For example, 
for five items involving only three skills, there are 32 possible patterns 
of correct and incorrect responses, but only 8 possible patterns of presence 
and absence of skills. Thus, if each combination of skills possessed deter- 
mined a specific pattern of correct and incorrect item responses, at most 8 

.of thfe 32 possible r^esponse patterns ^wm/ld be expected to occur. If hypo- 
thesized Sgkill hierarchies ruled* out some of. the 8 skill combi natioas , even 
fewer than 8 item response patterns would be expected. Of course, 'an item^ 
predicted skill requirements do not completely determine which students will 
get it right. Each item also Entails unique , processes , not represented by 
its skill requimnents. Moreover, carelessness, lapses of attention, errors 
in recording a response, etc. may lead to incorrect responses by'Students 
who. possess alj the skillssan i tern requi res , while suq^essful guessing or j 



elimiriation of distractors may lead to correct r^ponses by students lack- 

* . .' • «. - ,' ■ - 

■ing 'one or more requisite skills. In summary, students' responses to test 

i terns -jre. iWerfect indicators jDf the skills they possess and the skills 

itefns require. Students* possessing the requisite 'skills for ah item may ' 

give incorrect, "false negative" responses, while students lacking one or 

more of the skills an item requires will sometimes give correct, • "false 

positive" response^. " . 

The method used in this study for t^e validation of skills and their relation- 

« 

ships explicitly accounts for these imperfections^. The actual responses , 
students mark on their answer sheets are termed "manifest responses," and 
are distinguished from a hypothe^i cjil set of_^" 1 atent 'responses" reflecting 
only the skills items require and s-tudents possess. The pattern of latent 
responses shows which items would be answered correctly if false positives 
and false negatives never occurred. "There- is a set of latent responses for 
each pemissibte skill combinatioy. Thus, all students possessing a given _ 

■-combination of 'skills have the sam^ 1 atent*responses . They a-re said to con- 
form to the same latent state; examining any set of items, the possible 
latent sta'tes and the latent response pattern for each state are derived . 

'prior to the Computer Bnalysi s , solely on the basis of hjupothesized hierarchies 
among skills, and the di-fferent items' skill requirements. Often, for stu- 

•dents conforming to a given latent state (i.e., possessing a given combination 
of Skills) the most likely' manifest response pattern is the Same as the.latent 
response pattern for that state. Manifest response patterns differing for 
only one item from the latent response pattern are usually less likely, mani- 

< 
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fest patterns drffering for two items are stiy-less likely, etc. Each 
discrepancy between latent and manifest responsd patterns, either a false 
positive or a false negative, is« termed a misolassification. Full details 
on this class of models is given in Appendix A. 



The mathematical and statistical pr;x:edures used in this study--maximum 
I'ikelihood methods--yield numerical estimates of the probabilities of each 
possible FtnsclassifTcation ^or each item. Since every manifest response to 
an item is either a correct classification or a mi scl assif ication, the 
probability of a correct classification (a manifest response matching the 
latent response to an item) can be calculated as one minus the misclassifi- 
cation ptv^babil ity.^ 



At the same time as it generates estimate^ of mis<:lassif ication probabilities, 
the mathematical procedure produces esttmates of the proportion of the stu- 



dents in each latent state. These are referred to as estimates of structural 
parameters. Every student is assumed to possess one of, the permissible corn- 
el binati.ons of skills, i.e., conform to one of the latent states. Therefore,' 
the sum of the proportions in all of the latent states equals one. The 
statistical procedures usedj'n this study to estimate j:he parameters and assess 
the precision of the estimates are fully described in Appendix B. 



^In reporting the results of all analyses, a "true positive rate" and .a "false 
positive rate" are goven for ^ach item. The true positive rate is the pro- , 
bability of a "correct" manifest response, given that the latent response is 
"correct." This is one minus the item's false negative misclassif itation • 
Drobability. The false positive rate is the probability of a "correct" mgni- 
test response, given that the latent response is "incorrect," i.e., the 
-Item's false positive misclassif ication probability. 



V 
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4.2. The Data: Sample and Testing Design 

This re^arch^ required data f rom^ a large sample of elementary school children. 
Not only are the maximum likelihoo'ti methods used b^sed on large-sample theory, 
but in addition, to obtain stable estimates of population proportions for the 
many response patterns which can occur" across even a few items, numerous 
respondents are needed. In addition to having many respondents, it is de- ' 
sirable to have a large pool of test items from which to draw. Thi^' facili- 
tates item modeling by providing more, small sets of items which vary systema- 
tically in their skill requirements^ Finally, the data used in this research 
represent well-defined populations, so that estimates of population parameters 
and the.ir standard errors can be meaningfully interpreted. 

The Sustaining Effects Study, carriecf out by System Development Corporation, 
included the collection of achievement test data from a large nationally 
representative sample of pupils in grades one through six^. Data were collected 
in fail of 1976 usinV^ests of ' vocabul ary , reading comprehension, mathematics 
concepts, and mathematics computation. The sampling design and procedures 
employed in this extensive data col^ction are desci;^bed in Sustaining Effects 
Study Technical Report Number .1 (Hoepfner, Wellisch, and Zagorsk! , 1977) . For 
a representative subsample. The Participation Study, of the same pupils, Decima 
Research collected extensive, detailed information on home barkground and 
economic 5-tatu3 (Breglio, Hinckley, and Beal, 1978). The population and sample 
definitions for this data base are given in display 4.1. 

\ 
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, pi splay 4.1 ; Popul ati on 'and Sampl e. Design 



r 



Population : 



All 20,881,979 public elementary school pupils enrolled in 
grades 1 through 6 in the 50 United States, during the 
1976-77 school year (62,534 schools). 



Sample Design"; 



2-stage, stratified random cluster sample, impl ementecf wi th 
replacement schools to adjust for non-cooperation. 



Strata : 10 Federal districts 
^ 3 LEA Sizes, 

LEA Poverty levels 
Yields: 90 Strata 

-_6 Strata without schools 
Yields: 84 Strata 

Clust^s : 3 schools per stratum 

Yields: 252 schools = 84 strata times 3 schools /rer stratum 

-J_0 lost without substitution 
Yields: 242 schools 

Units: 18,000 pupils ' ^ • 

362 lost or moved , 
Yields: 17,368 pupils » 



Yields: 17,366 pupils on final data file 
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The Sustaining Effects/Participation Study provides data on the reading com- 
prehension sxal-es of the Comprehensive Test of Basic Skills (CTBS) form S, 
at levels A (grade 1) through 3 (grade 6). -These tests were given, in pairs, 
at -each gra^^e/^level . Each test l^evel and the grades ^t which it was given 

% 



exhibited tn Display 4.2.^ 



4 



, As discussed in the»s.ection on th£ problem of design effects (Appen/lix B)^ 

the theory on which the chi-square test and asymptotic standarti^^errors are 

* based requires a simple random sample from the population. Data from the 

' Sustaining Effects S^udy, however, represents a stratified cluster sample. 

In thi^s study, a universe of schools was first defined, and all schools in 

. the universe were divided into strata according to size, location and other 

demographic characteristics. For the Sustaining Effects Study, the universe 

' included pub>ic schools with. some of gradts 1 through 6. Onc^ strata were 

defined, sDme schools were randomly sam|?led within each stratum, and the 

students tested were all .clustered within these selected schools. In the 

Sustaining E/fe€ts Study, students were randomly sampled within schools, and 

the number tested was determined by the school's size. 
/ * 

A preliminary analysis was conducted to estimate the effective sample size. 
Th'e fifth grade was chosen f6r this analysis, and four representative items 
wer? selected from the level 1 test. Each fifth grade student's response * 
pattern across these four items was tabulated, and the variance of estimated ^ 
'proportions in each of the 16 response cate^gories was computed using the ultimate 
^ cluster .estimate of the rel-variance for ratios (Hansen, Hurwitz, & Madow, 

1953^ pp. 316-321). To obtain the standard error of each estimate, the square 



\ 



Displa-y 4.2 . Test Form and Leve3-^ Administered and Sample Sizes, by Grade 



Test: CT'bS - Form S - Reading Comprehension-(including Sound Matching) 
Levels and Their Characteristics 



Level 
A 
B 
C 
1 
2 
3 



Title 



No. of No. of Sentences/ Response options 
Passages Items Passages per item 



Sound Matching 0 
Reading Comprehension 24 
Read., Comprehensions Passages 6 
Reading Comprehension 7 
Reading Comprehension 7 
Reading Conprehensi on 7 



28 
24' 
18 
45 
45 



.1.4 
7.8 
9.1 
11 .9 
11.4 



3 
3 
4 
4 
4 
4 



Level/Grade Match and Sample Sizes 



Grade 


Bel((w Grade 


Level 
Level At 


Grade tevel 


Before edit 

M • 


Sample Size 
After edit 


Effective 


1 


A 




B 


3103 


2598' 


799 


2 


B 




C ' 


2750 


2188 


884 . 


3 


C 




^ 


2753 


■ 2395 


986 


4 


1' 


* 


2 


■ 2638 


2327 , 


919 


5 


1 




2 


2737 


2520 


lOOS 


6 


2 




3 


3385 


3017 


1127 
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root of the rel-variance "was multiplied t>y the estimated proportion. In a- 
simple random' sample,' the standard error of a proportion, p, is simply the 
square root of p times one minus p, divided by the sample size. Using • 
this formula, the effective sample size could be computed for the estimate 
of each proportion by det^mining the size of the simple .random sample which ^ 
would yield the same standard error as that actually obtained. To arrive at 
a single estimate of theraff^cti ve sample size for use in the Study, the 
harmonic mean of the 16 effective sam>le sizes was (Computed, weighting each 
according to the corresponding estimated proportion. Once the grade 5 
effective sample size was obtained, effective sample sizes for other grades 
were estimated by calculating the size of a simple random sample wKich W0LM^^^ 
yield tne obtained standard error, given the obtained standard deviatidn. 
Since the ratio of the actual sample siz^ to the bBtained sample size^sjiould 
beVelatively invariant across grade leveTs, effecti\re sample sizes for the 
grades could then be estimated using the fifth grade effective sample size, 
the fifth grade actual sample size, and the actual sample size at the other 
grade levels (Display 4.2). 

4.3, The Analysis: Item Selection and Model Specification 
In designing the analyses, we selected a series of items from each test^level. 
''The CTBS-Forms test levels chosen for analysis were: B; £^ , 2. Within 
each level, we chose three iji^s tJnder the .constrai nt that each relate to a 
differervt reatflng passage. ^ Thus, a total of 12 items were originally selected 

M • ■ ■ ' , . 

^Haertel, in the earlier study (1980), fouri^ that sel ecti ng^ more than one item 
relating to the same reading passage' resulted in dependencies Which distorted 
the generality of the skill, defining it in a passages-dependent context, 
' fixing- vocabi^r^ and other passage characteristics. \ 



4 



from the fouc test levels. Additionally, "a set of 12 items weue indepen- 
dently sejectedousing Jihe same constraints. In the following section, the < 
first *et is' re furred. to as the "X-items" and the second set as the "Y-items." 
The full specification of these items is given in Display 4.3. 

/The two sets were then used to produce tw6 separate "chains/ of linked 

analyses. The analyses were specified by fitting a two-state model— can solve 
vs.' cannot solv€--for each grade inj/hich only one test level was analyzed: 
r Grade 1 - level B ^nd Grade 6 - level' 2. For the other grades, three-st^te 
models were fitted. Display 4,2 specifies /ttfe level combinations anal;/^ed 
in these grades. ^ 

The skill combioi^on states specified for these latter analyses were^ formal i zed 
as possession of (a) neither of the ski 11 s ' corresponding to .the' a>ialyzed test 
level5^-(b) the skill corresponding to the lower-level test, but not the upper- 
level bne, and (c) the^skills corresponding to both test levels^ 



The^ implicatians of these state definitions for misclassif ication proportions 

are: ^ * ' 

State (aj' -All correct responses ^re 'false positive and 
' an incorrect responses are true negatives. 

State (b)- -All corr^ responses to items on the lower 
form are true positives while all incorrect responses to 
•these items are false negatives. All correct responses 
to items on theVhigher form are false positives while all 
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Display 4.3. Items and Passages Selected for Analysis, by Test Level 



C 



Level X-i terns ' . Y-i terns 

Item No. ^ Passage No . Item No . Passage 



11 2 1 

4 4 6/1 

15 ■ 15 . 7 nT5 



1*1 2 '1 

6 2 15 . 3 

18 4 16-4 



1 11 11 2 

16 3 18 3 

29 5 28 . 5 



6 2 13 3 

14 3 21 4 



33 6 26 



incorrect responses to these items are true nega- 
tives. 

State (c )--AII c?orrect responses 'are true positives 
• and all incorrect responses are false negatives. 
The compTey' chain of X-item analyses over grade levels was then replicated 
with the Y-items, producing twI^IWtern^te sets- of estimates of the "latent 
state" parameters (Display 4.4). Finally, a simple scaling model was used _ 
■ • to extend the estimates 'of the proportions of individuals at each skill level 
over all grade levels. ^ . 



Display 4.4 . Skill-Level Proportiorinii recti y Estimable, by Grade 



Skill-Level Proportion 



Grade 



less than B less thajiC less tharf 1 less than 2 

— 1 



s 
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5. Grade Progression in Reading Skill: Tentative Assessments 
This section displays and discusses the results of the statistical analyses 
outlined in Section 4.3. It is organized into four^ubsections which focus, 
in turn, on the fit of ^e models, the estimates of misc^assification rates, 
the changes over-grades in proportions of indiv,iduals possessing various levels 
of reading skill, the precision of the grade-change estimates, and a prelim- 
inary extension of those estimatesr^ 

V 

5.1. The Models: How Well Do T hey Fit ? • • 



The empirical study which preceded this one (Haertel, 1980) strongly supported 
the conclusion that standardized tests of read.ing comprehension--a t least those 
inten(ied>,for elementary school pupi 1 s--caft only 'grossly differentiate the skill 
levels of such. pupils. In fact, using the class of statistical models that 
' ^ are fitted'here, the earlier study found that distinctions beyond the dicho- 
' tomy "can solve-cannot solve" were not attainable within a specific test level. 
Thus, in ilivestigating grade-level progressions in skill, the first issue to 
resolve was whether distinct test levels required disj:inct skills, or--more 
accurately/--wl3ether the skill differences manifested between the test levels 
were detectable with the sample sizes and methods used in this study. 

Display 5.1. organizes and exhibits the evidence bearing on this issue. The 
two states (can sol ve-cannot solvq;) used earlier were .fitted to the data from 
pupils at grade levels 2, 3,' '4, and 5. The three state models described in 
Section 4".3.\ere also fitted to these data. The left hand columns of the 
display exhibit the grade levels, test-level combinations, and ite[ii sets for 

I 
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Display 5,1 , Compfrisons of Two and Three Latent State Models 
Grade Levels- Item Set 







- 


L." b La Lc 




■^-c fa to 


mnH p 1 




SI a. 








X2 








- x£ ^ . 




2 


BC 


X 


93.08 


50' 


34. 16 


49 


58.93 1 


<r.ooi 






. Y 


80.32 


50 ■ 


39.26 


49 ■ 


41. '06 1 


' ^..001 


3 


CI 


X 


. 77.88 


■50 


54.03 


49 


23.85 1 


^.001. 






Y 


68.24 


50 ■ 


36.07 


49 


32.17 1 


<..001 


4 


12 


X 


64.15 


50 


■58.23 


49 


5.92 1 


.015 






Y 


54.69 


50 


48.92 


49 


5.77 1 


.016 


3 


, 12 


X 


44.57 


50 


42.02 


49 


^55 1 


n.s . 






Y 


39.77 


50 


35.45 


49 


/.32 1 


.05 

m 
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ch modefs Were fitted.* The remainder of the table contains the likelihood 
. ratio chi-square values for the models, together with the difference between 
them.^ The letter statistic yields an assessment of the value of the third 
State in explaining the responses of the individuals. Thus it informs us^ about 
whether, ski 11 level differences are manifested, in a detectable form, between 
the test levels. 



The evidence clearjy supports test level di fferenceSj^ especial ly among the 
earlier ones. And none of the thre^-state model s, display more than chance 
levels of lack of fit. The two-state models clearly do not fit well for the 
early grade levels^ with the fit improving in higher grades. Thus, levels 



♦ 2 ' 2 

The difference 7 is merely the difference in K values resulting from the 
two estimation procedures. Under the hypothesis that the two-state model J s 
correct, it is distributed as (central) 0(2 with one degree of freedom. 



ERiC 



3i 



B and C manifest clearly distincttve skills. ThMs is t-o be expected, as 
the^e test tevels are quite different (Display 4.2). 'Level B contains only 
one itpm |>er passage and .each passage .averages only 1.4 sentences in length. 
On the other hand, Level C was constructed with three item per passage and 
the. passage lengths average almost -^eight sentences. Clearly different 
skill levels are required and they are amply manifested in the data. 

Differences, likely smaller^ but still clear, are exhibited between Level C 
and Level 1 as well. Such differences, however, become difficult to detect 
when we compare Levels 1 and 2. For the Fourth grade group, there is some 
evidence^but it is considerably weaker thhn at lower test levels and no evi- 
dence of such distinctiveness 'i^s apparent at Fifth grade. In what follows, 
we will maintain the Level 1-Level 2 dJ|^nction byt the proportion of in- 
dividuals estimated to be in the Lev^l l--intermediate--state is uniformly 
small. ' 

All in all, there are no obvi ous ?dif f erences between the two item sets (X and Y) 
in the evidence they provide and the tKree-state models all fit the data well. 

t * ' 

5.2. Response Validity: Matches anJ Mismatches between Manifest Response 
and $ki 1 1 Level . 

Rates Of v^lid correct>responses . Display 5.2 (A and C) exhibits estimates of 
the rates at which individuals in the various grades . respond correctly to each 
of the items^ when they actual 1}% possess the readin-g skill appropriate to the 

- • . 

^ U i 

5 ■ , . 

Note that ther^ are 30 i tem/grade-l evel combinations for each i.tem set. 
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particular test leveh Of considerable importance is the fact that all these 
values (except two) are estimated to be less than one* And the vast majority 
of these '/alues are precisely enough estimated to be clearly distant from 
one in fact. This implies that there is an appreciable probability that i 
individual possessing the, relevant skiTl will manifest an incoi*rect response* 

Note, however, that all values but one exceed 0*5, which is surely a baseline 

jof'minimal validity, and also that of the thirty-six potential differences in. 

• . 6 
parameter values across adjacent-grade levels, thirty-two display, lYicreases. 

This implies that*, for parti cul ar items, factors which caOse skilled individuals 

to resDond ^incorrectly dimini sh in their impact over grades* 

• Rates of invalid correct responses . Display 5.2 (B and D) also exhibits esti- 
' mates of the rates at which "\ndividual s in^jttjs^'vari ous grades respond-correctly 
to each of the items when they g^tua^lj ^o /lot possess the reading skill afjpro- 
- priate to the test level. In^jnOre simplified models of the response process, 
these rates are termed "guessing" probabilities and are sometimes "corrected" 
via "formula" scoring. Note that forty-eight of the sixty estimates exceed 
the'nofDinal '(equi -probabl e) "guessing'' values.^ Note also that of the thirty-six 
potential differences in parameter values across adjacent grade levels, thirty- 
four display increases. This implies that^, for particular items, factors which- 
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6 , • ■ ■■ ■ 
'Note that items from Levels 1 and 2 were repeated in three grades while those 
_from Levels B and C Were only repeated in two grades. 

7 ' . 

"Guessing" probabilities are usually estimated by the reciprocal of the number 
♦ of response options. Thus, the nominal values are 1/3 for Level B and 1/4 for 
Levels C, 1 , and 2 (Display 4.2). 



< 
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Display 5.2.A Misclassification -Parameter Estimates, by Item and Grade 



true Positive •- Set X 



Grade 



Level Item 

B 1 
. 4 
15 

C 1 
6 

18 

1 " '5 

16 
29 

2 ■ 6 

14 
. 33 



. 1 

i.oob 

0.757 
0.73.4 



0.999 
0.998 
0.983 

0.960 
0.922 
.0.743 



0.994 
0.967 
0.945 

0.887 
0.865 
0.849 



0.945 
0.9^4 
0.947 



0.925 
0.962 
0.975 



Display 5.2.B . Misclassification Parameter^Estimates , by Item' and 



Fal -,e Posi ti ve ,- Set X 



[ tern 

t 

4 



1 
6 

18 
5 

16 
29 

6 

14 
33 



0.280 
0.507 
0.360 



0.654 
0.678 
0.442 

0.414 
0.240 
0.244 



Grade 



0.601 
0.441 
0.277 

0.391 
0.274 
0.222 



0..9W 

0.672 
0.571 


,0.963 
0.719 
0.633 


0.976 
0.858 
^ 0.758 


Item' and 


Grade 

) 




u 


5 










0..580 
0.457 
0.379 


0.687 
0.484 
0.401 




0.^93 
0.214 
0.318 


0.484 
0.259 
0.328 


0.599 
0.230 
0.341 
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Display 5.2. C Misclassification Pa.rameter Estimates, by Item and Grade 



True Posi ti ve - Set Y 
Level Item 



2 
6 
7 

2 

15- 
16 

11 
18 
28 

13 
21 
26 



Grade 



1 

0.842 

l.OOO 
0.612 



V 



0.983 
0.999 
0.999 

0.898 
0.965 
0.912 



0.970 
0.984 
0.992 



0.392 


• 0.502 


0.517 ■ 




0.631 


0.911 


0.912 




0.903 


0.951 


0.964 






0.822 


- 0.835 


0.854 




0.888 


0.928 


0.973 




0.897 


0.941 


0.976 



Display 5.2.D Misclassification Parameter Estimates, by Item and Grade 



False Positive - Set Y 

Level I tem 

B 2 
6 
7 

C 2 
15 

16 

J 11 
18 
28 

2 13 
21 
26 



Srade 



1 

0.373 
0.344 
0.319 



0.592 
0.635 
0.507 

0.303 
0.439 
0.246 



0.423 
0.562 
<).396 



0.220 


0.230 


0.238 




0.243 


0.331 


0.407 




0.351 


0.471 


0.520 






0.465 


0.441 


0.448 




0.307 


0.39i 


0.356 




0.296 


0.381 


0.497 
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cause unskilled individuals to respftnd correctly increase in th^ir impact 
over grades. 

Factors contributing to invalidity . It is instructi ve. to recall here the 
threats to validity of reading comFfrehensi on inferences which Crontach (1971) 
took fr^om Vernon (1952). Of the six threats which he summarized, three would 
affect the rate of true positive responses and three the rate of false positive 
responses. Those falling in the fir^t category include: speed, motivation, 
and vocabulary. N an individual possessed the requisite reading comprehension 
ability but a) took longer to read SInd respond than the time allowed, b) found 
the material sufficiently foreign to. his experience or interest to try' hard, 
or c) |iad insufficient vocabulary to exercise his comprehension ski-lls, t^en 
he might respond incorrectly. These factors, however, would have no impact on 
the rate of false positive response. 



On the other hand, recognition/recall, test-wiseness , or prioV information would 
have no impact on the rates of true positive responses. However, if a) the item 
tested recall or recognition rather than comprehension, b) the individual had 
the skill to eliminate inappropriate response options without comprehending tt>e 
passage,or c) if he knew the answer. ^wi thout reading the passage, the unskilled^ 
individual could attain a correct re.sponse at a' rate above^ the base guessing 
probabi lity. - ^ • . 

I 

5.3. Grade Progressions? Direct Estimates^ Precision^ and Extens.ions 
Estimates . The estimates of the proportions of individuals, within each grade, 
who possess skills below each test level are given in Display 5.3. As the 



Display 5.3 . Estimates of Latent State Probabilities, -by Grade and Item Set 



Item Set Grade 



1 
2 
3 
4 
5 
6 



Cumulative Probability of. State 
^ <C <1 . <2 



0.843 
0.270 



0.508 
0.333 



0.411 

0.444 0.504 
0.289 0.325 
0.295 



1 
2 
3 
4 
5 
6 



0.789 
0.314 



0.481 
0.274 



0.489 
0.498 
0.345 



0.555 
0'.391 
0,263 



test level/grade level matches were not complete, estimates are missing for 
the I6wer test levels in the higher grades and vice' versa. As^the proportions 
are cumulative, they increase over skill levels within a grade. These in- 
creases result from the definition of the proportions and are not empirical 
findings. The proportions decrease over grade levels for a particular skill 
column. This is an empirical finding' and signals the increase in the proportion 
of those attaining particular skill levels over grades. The onjy exception to 
this occurs between the first and seceind entries in the third column and these 
differences are small and probably reflect the fact that the skills reflected 
in test levels 1 and 2 are difficult to distinguish (see Section 5.1.). 

The corresponding values estimated from the two item, sets (X and Y) are approxi- 
mately equal and the general findings are consistent and clear. The percentages 
of individuals who possess the most minimal comprehension skills (level B) 
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increase from about 20 percent at the beginning of grade one to about 70 
percent at the beginning of grade 2. 'At higher skill levels (.level 1 or 
above)., the percentage of skilled individuals increases from about 25 percent 
in grade three to substantially more than 70 percent by grade six. 

♦ 

■\ 

9 . • 

Precision . Estimates of the variances and covariances of the values given 

in Display 5.3 are 'exhibi ted in Displa^5.4. These estimates are organized 
by item set'and grade level. Because the grade-level samples are constituted 
. of different individuals, parameter estimates for distinct grade levels do 
not covary. Thus, covariances are displayed for estimates pertaining to 
coranon grade levels only. The "fi/st" amd "second" designations in the column 
heading^refer to the first and second entries in the corresponding row of 
Display 5.3. 

» « 

Values of' the first and sixth grade variances are larger than the other values 
beciuse two-state models were fitted to data deriving from only one test level 
-^he'n three-state models are fitted to data from two appropriate test levels, 
individuals are more finely differentiated and standard errors of estimates 
diminish even though estimates are unbiased, in either case, under the model. 
This is akin to the increases in precision accompanying an analysis of covaria 
In the case of precision estimates, differences between item' set-X values and 
item set-Y values are real because -distinct item sets are differentially infor 
mative about the parameter values. 
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, [(1. - .84 + 1. - .79)/2 ^ .19] ^ • . 

9 / 

These sampling variances and covariances are based on the effective sample 
*sizes'rather than the actual -ones and thus are .adjusted for the sampling 

design's effect on precision (see Appendix B). 



3b 



Display 5,4 , Estimated ^ariances and Covariances of Latent State 
Probability Estimates* 



Estimated D ispersions (XlO'^) 

Item Set Grade' fi rst variance . covariance second variance 

X 1 210.629 

2 5.083 2.297 7.125 , 

■ 3 3.662 2.634 4.523 

' 4 ' 7.352 4.979 8.974 

5 6.274 4.312 . 7.623 

6 26.276 



■ Y 1 33.160 

2 2.083 0.971 2.227 

3 2.056 1.619 . 7.622 

4 * 3.297 1.^38 2.726 

5 ■ 3.041 1.751 2.629 

6 7.251 

These dispersions should be referred to the cumulative probabilities given 
in Display 5.3. All values should be divided by 10^. 



Extensions . If data were available on the whole range of test levels for in- 
dividuals in each grade, Display 5.3 could be extended to show how extensively 
skills at each level were mastered by those in each grade. Display 5.3 exhibits 
the results of an analysis which extended these values indirectly. T|iis 
*^nalysis was performed by scaling the values in Display 5.3 according to a model 
which assu^d that the cumulati^ probabilities could be logistical ly transformed 
SQ that the values resulting were an additive function of parameters represent- 
ing grade and test level. ^ 



10 ■ . \ 

' Formally, the cumulative probabilities were transformed via • = ln[p- j/(l-p^ j 

and the model : ' e ^ 

' A^-j =/*+<^- +^ij. i " 1 5"j = l,---,4. 

was assUied. The original Xij are given in Display C.l (Append^C). The • 

baseline, grade, and test-level parameters are given in Display C.3. The 
(continued on next page) 



Display 5.5 . Fitted State Probabilities, by Grade and Item Set 

State Probabilities 

Item Set Srade^ 



1 
2 
3 
4 
5 
6 



1 
2 
3 
4 

5 
6 



<B 


B 


C 


. I 


2 2 


0.843 


0.094 


0.023 


. 0.007 


0.033 


0.270 


0.238 


0.112 


0.047 


0.333 


0.152 ■ 


0.181 


0.108 


0.051 


0.508 


0.153 


0.183 


0.108 


" 0.051 


0.505 


0.082 


0.117 


0.083 


0.043 


0.675- 


0.072 . 


0.045 


0.077 


■ 0.041 


0,705 


0.789 


0.094 


0.067 


"0.010 


0.040 


0.314 . 


0.167 


0.220 


0.043 


0.256 


0.157 


0.117 


0.215 


. 0.053 


0.458 


0.162 


0.119 


0.217" 


0.053 


0.449 


0.092 


0.078 


0.171 


0.Q50 


0.609 


6.053 


0.049 


0.122 


■ 0.039 


0.737 



The entries in Display 5.5 should be treated with caution in tracing s1<ill 
gains as they are bas^d on stringent assumptions about the uniformity of such 
skill gains over grade levels. They, however, do provide some baseline data 
for future studies of skill acquisition. 

6. Concl usions . 

There are two- major Wusts Of this study. One relates to the discussion of 
test validity updertakeViaSection 2. That discussion attempted to lay out 



variances ana aovariances of the logits are given in Display C.2. Estimates 
were derived iJy sequentially differencing the logits, beginning with grade T 
and averaging the sole. pair of duplicated estimates. This "degree of freedom 
was also used to "test" the model, using values from Display C.2. Resulting 
parameter estimates were used to reconstruct cumulati.ve probabilities for all 
table locations and these were di.fferenced to produce Display 5.4. 
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tfie ground for a . reconception of validity based explicitly ogf^he notions, 
that test? have intents and that their characteri stf<;s neyer completely 
• match -those intents'. The deduction from this specification was that what 
tests are intended to measure oaght to tTe defined in a fashion tha-t is', both 
wrbally and fomjaVly, independent of the test instrument. Only such a 
defanitior^ill allow the use of the construct .validrty notion in a produc- 
'tive fashion, differentiating invalid from v^lid components of measurement, 
ev9i:T^hen they arfe related to one anotlSfer. ^ " ^ « 
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The -methodology usjfd and^the data sets to which it was applied permitted us 

to explore (-Haertel , 1980) and define analyses which came to empirically ^ 

^ distinguish between reading comprehension and other, related, characteristics 

- which standardized tests of reading cofprehension measure. The distinction- 

arrived'at is surely incomp-lete, but the results are surely provocative 

% " - 

enough to stimulate considerable- further work. Because of our ability to 
^distinguish between parameters which related- only to the reading comprehension 
bQTi^^jTir^ other parameters which directly reflect^ components of in- 
' validvtyv and because these latter parameter|^ar6 further differentiated 
with rest)ect to the parti'cular variety of invalidity, we were able to trace 
Changes in the validity of Ihe reading comprehension test scores over grade 
levels. In doing this, we observed that some'Tcpmponents of invalidity de- 
^ crease while others increase as the grade level, and thus the reading compre- ^ 
hension skri 1 b increases. We t)elieve that the conceptual framework and the 
modes of analysis us^^d here wiW eventually lead to a much more structural 
and sureTy more accurate analysis of the validity of test|^scjch as those af^alyzed 
•in this p^per. • * ^ * y 
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The second major challenge taken up in this study was that of estimating 
grade level changes in the reading comprehension skill attainment 5f American 
eTementary school pupils. And we wished ta do that in a^shion which would 
'separate the valid components of reading comprehension measured by the tests 
^rom- related, but invalid, Components. We have done this. But how accurate 
anVflfl^ningftil are these ^estimates? First, we are constrained by the tests 
in two di sti net* f ashi ons ; . 

(1) the items on these tests do not allow refined 
measurement of reading subskills actually ad- ^ . 
dressed by them (Haertel, 1980), and 

m 

(2) these items may also miss major components of ^ ; 
' the reading process which are rightly called 

comprehension. 

We do not view tjie former as problematic because we wished to address the 
reading comprehension process at a more general and socially meaningful level 
The latter may tJe more an issue in the lon^ run t>ut we have no simple way 
of addressing' it in the context of this study. A third threat to the accu- 
racy and meaningfulness of the estimates relates to the above discussion 
concerning the components of invalidity and the accuracy with 'which the sta- 
tistical procedures removes their influence on the comprehension estimates. 
This issue is not fully resolvable in the absence of further work but we, are 
encouraged by our results. 

Finally, assuming that our conceptions and models--at least in outline--are 

appropriately and correctly 'focused, what remains to be done? From our 

t 
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perspective, at leSist three lines of work have positive -value: 

(1) further theoretical and empirical work yhich will 
independently "validate" the' components of iji- 
v^lidity which we believe we have "trapped" in e(ur 
' misclassif ication parameters. E.g., relating in-^ 
dependent assessments of vocabulary knowledge to 
the true positive rates and test-wiseness assess- 
ments to the true negative ones; 
' (2) direct exploration of the implications of the models , . 
and analysis of further data to fully articulate the 
validities of existing tests and their consequences 
for biases in the assessments of individuals in par- 
ticular groups or with specific characteristics; 

(3) application of the techniques to existing data sets 
withhiore desirable Characteristics in terms of item 

V 

c::;^ selection, age levels, subpopul atlons , e.g., NAQE data; 

(4) the creation of new tests developed jto minimize con- 
tamination by the components of invalidity isoVal^ed by 
our techniques. ^ 

This paper brings to first fruition an analytic schema based on four elements. 
These involve a conception of skills independent of particular testing devices: 
the development and application of class of statistical models incorporating . 
qualitative definitions of skill, distorted in item response by errors con- 
ceived as misclas^if ications; a critique and reformation of the concept or 
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test Vjalidity--making more concrete and specific the implications of in- 
validity;, and an integration and fusio»«-of these concepts which allows 
meaningful empirical analyses of item response <iata. We believe that this 
conception/rttodel will contr>bute to the clarification of previously intr^c 
table^ technical and policy issues in the testing field. 
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Models for (Qualitative Data With Mi scl assi f ications 

V 

A.l. Latent States 

The skills explored here are each regarded as dichtomous. It is assumed that 
with respect to each skill, students all belong to one of two categories: 
those who possess, the skill and those who do not. Of course, a student's 
membership in one or the oth-er category is not observable, but may be inferred 
from his item responses. These ir^ferences are always subject to error. Thus, 
the two categories defined by each! skill are said to be latent states , and 
may only be inferred from the student's manifest responses . 

When more' than one skill Is considered at a time, each possible pattern of 
present and absent skills gives rise to a distinct latent state. As an 
example, consider two skills, A and B. These could give rise to four latent 
tates: (1) lacks A lacks B, (2) lacks A had B, (3) has A lacks B, and (4) 
h^s A has B. Every student vvould fall into one of these four patterns, and - 
would conform to exactly one latent state. Just as two skills yield four 
latent states, three skills could give rise to eight latent states, four 
skills to sixteen, and so forth. 

In general, sotne latent states may be excluded on theoretical grounds. That 
is, it may be hypothesized that there are some patterns of presence ancf absence 
of skills which will not describe any students at all. In this study, such 
constraints are expressed as hypothesized hierarchical relationships among 
skills. Where one skill is logically, psychologically, or chronologically 

4 : 
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posterior to'another, the former is' said to be hierarchically related to the 
latter* Suppose, in the two-skill example, that Skill B is logically dependent 
on the presence of Skill A, Then Skill B would be said to be hierarchically 
related to Skill A, and no student would be e^xpected to belong in the latent 
state "lacks A has B/' Under this assumption, only three rather than four 
states would be required to classify all students. In the absence of any hier- 
arthical constraints, four skills would give rise to sixteen latent states. 
However, a strict skill hierarchy would prohibit all but five of these skill 
combinations . 

t 

t 

The distribution of skills in a population of students can be described completely 
by the proportions of students in each latent state. Since every student is 
in exactly one latent state, these proDortioVs must sum to exactly one. 

A. 2. Misclassifications 

An item's skill requirement is whatever set of skills is required to saJve that 
item. If a set of items with appropriate skill requirements is cfiosen, student's 
overt responses to the set of items may be used to classify them into one of ^ 
a set of manifest states exactly corresponding to the latent states described 
above. To continue the earlier example, suppose the skill requirement for item 
1 consists only of Skill A. Than students can be divided into two manifest states 
on the basis of their res,ponses to item 1: "lacks A" and "has A." Suppose 
item 2 has as its requirement Skil/s.only, -Then the four possible patterns 
of responses to items 1 and 2 wouVlh define four manifest states, corresponding 
to 'the four 1 atenf states earlier. 
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Obviously, a student's rr^anifest state and his latient state need not corres- 
pond* This is due in part to the use of the multiple-choice question format; 
which affords students the options of guessing or of finaing the correct 
answer*by a process-flf elimination* Even if a free-response fonr.at were 
used, howeyer, iterr. responses would give irr.perfect infonriation as ti^stuoents' 
possession of underlying skills. This is because (1) every item entails 
unique processing requi repeats not captured by its skill description; (2) 
tne treatnent of skills as unitary entities is an iir^ierfect approxirration , 
tnus a student's abiljty or inability to erploy tne spec'i fi c iTrOcesses 
required. by a single item is an imperfect indicator of his anility to apply 
related processes; (3) even a stuaent capable of ero^oying tne processes 



required by an item may fail to ao so oue to lapses in attention, careless- 
ness, etc., and (4) errors i^i^coraing tne response will sometimes occur^ 
tnougn tney should be rare. j 

The rela^on between latent states and manifest states is 'probaDH isti c. In 
tneory, it is completely described by the set of- condi ti onal probaoi 11 ti cs of 
^ac^i ranifest state being oDservedi given membership in each latent state. 
These conditional probabilities are presented in the form of a misclassifica- 
tion matrix .'* The rows of this rratrix correspond to^ manifest states, and the 
columns to latent states. The entry in the itn row and the jtn column is the 
conditional probability of a response in tne ith manifest state,' given confor- 
' mity to the jth latent state. For the two-item example described earlier, 



^For a systematic development of mi^classi fi cation matrices, their properties, 
and applications, see Sutcliffe (1965a, 1965b). 
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the misclassif ication matrix would be as shown in Table Al., Each entry in 
Table Al represents the conditional probability of a manifest state given a 
latent state* For example, the entry in the first row, first column 
("'P (ABtAB)") is read "probability o'f manifest lacks A, Tacks B cl assi f i catfon, 
given latent state lacks A, lacks B/' Note that the diagonal entries *of the 
misclassification matrix represent the probabilities of correct classification 
given each latent -state* All off-diagonal entries correspond to misclassifi- 
cation (errors)* The entries in each column of a misclassification matrix 
sum to one* 

From Tabl& Al, U would appear that, for the two-item example, three independent 
conditional probabilities could be specified for each of the four columns of 
the misclassification matrix. (Tne fourth entry in each column would be ob- 
tained by subtraction, since each column sums to/)ne*) In practice, speci- 
fication £)f tne misclassification matrix is simplified substantially by the 
.^S^mption cff conditional independence . This ^sumption is required^by 
; rtually every statistical theory of test responses which distinguishes latent 
from manifest states* It is assumed that within any group of students in the 
same latent state, .the (condi.tional^ distributions of responses to different 
items are all independent of one another (Lord & Novick, 1968, p. 316). That 
is to say, the conditional probability of a correct response to any item, 
given a student's latent state, is the sanie regardless of his responses to all 
other items. It is a consequence of this assumption that within any column 
of the fnisclassifi cation matrix, i.e., conditional upon any particular latent 
state, the probability of any pattern of item responses is simply the produat 
of the conditional probabilities of the responses to the separate items. 

? 

* \ 



TABLE At»--fl'ScU55l fication Katrix ^or th-e Two-Item Exa'^.ple 



Latent State 



Manifest State 

Lacks A, Lacks B (^3") 
Lacks A, Has B (^"0) 
Has A. Lacks 8 .(A3) 
Has A, Has B (A8) 



Lacks A, Laci^s B (AD ) Lacks A . M as B (AB ) Has A, Lec^s 3 (Aj ) Mas A. Has B (A3 ) 



P (AB|ffD) 

P (fBl^B) 

P (AB|7^) 

P (ABfAl) 
^ 



P (5b|7\I 



P (ABjSB) 
P (AOIAB) 



P' {/^, AT) 
P (AB;AD) 
P (AO|AB) 
P (A3);AB) 



P (rj.^B) . 

P (A3;AB) 

? (ABjAG) 

P (A:|A3) 
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A'Second (and more substantive) sirnpli/ying assumption is also invoked in 
specifying misclassifi cation rr^trices: Mi scl assi fi cation probabilities only 
vary with the unions of latent states , ^conforming to or not confonning to 
the skill corrbination required by any given item. That is to say, for any 
item the ^'misclassification probabilities for different latent states ,depend 
only upon wnether or not all the skills that item requires are present.. 
a set of latent states are defined using all the skills an item requires (and 
possibly others as well^ then the item's skill requirements can be used'to 
partition those latent states into two categories. In the first would be 
latent states for which all of the skills tne item required v.ere present. 
In the second would oe all latent states for which one or more of tne skills 
the item required were aoserrt. Witnin eacn categor/, misclassi fication 
probabilities for all latent states would be tne same. Tne presence or 
aosence of skills not part of the given item's^skill requirement is irrele- 
vant, and if tne entire set of skills tne item requires is not present, it 
does not rratter whicn or how many of tne relevant skills are lacked. 

Table A2 shows how the assumption of conditional independence permits simpli- 
fication of the mi'Sclassification matrix for the two-item examp^ Note that 
* each conditional probability is decomposed into 'a product of two factors, one 
for each item* New notation is introduced in Table 3, to make explicit the 
relation of manifest states to particular items* Again taking the entry in 
row 1 coluT^ 1 as an example, P (^^i^bl • P (^2'^^ represents tne proDability 
of a manifest "lacks- A'' classification on item 1 (i.^. , marking an incorrect 
alternative or omitting item 1) given latent state Macks A, lacks B/'. times 
the probability^XJf a manifest "lacks B*' classification on item 2 (defined as 
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TA8LE/\i*--Sirpl If icatlon of Mi sclassi f ications Introduced by the Assur^ption of Cordi tlonal Indeperdence 



2. 



LATENT STATE 





Mani fes t State 


Lacks 


A. 


Lacks B 




Lacks A , 


Has B ("AB) 


Has A, Lacks B (A3) 


Has A, has 


5 (AB) 


I te-n 
I ten 


1 - Lacks A, 

2 - Lac^s B (AjBj) 








• P(B,I 


AB) 


« 


• PIBjISB)" 


PC^llAB) 


P(3j' AF)' 


P(TijA3) • 


?(3:;A3} 


I tem 

I te'^ 


1 - Lac^s A, ' 

2 - Has B (AiB^) 


P(A, 


A3) 


• P(B2l 


AB) 


P(A,, AB) 


• P(B2|AB) 


P(3:i|AB) • 


P(3;; A3) 


P(^l,A3) • 




I te-n 
I te^ 


1 - Has A, 

2 - Lacks B (A^Bj) 


P(Ai 


M) 


• P(^.l 




P(Ail ^B) 


• P(^2|^B) 


P(Ai|AB) • 


p(3;; Ai; 


?(AiiA3) • 


PCS';; A3) 


I te-n 
Item 


1 - Has A, 

2 - Has B (A^Bj) 


P(A, 


AB) 


• P(B2| 


AB) 


P(Aj| AB) 


: P(B2iAB) 


P(Ai|AB) • 


P(B2, A3) 


P(AiiA5) • 


?(32,A5) 



I 
\ 



5. 
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before) gi vfen^ 1 atent' state "Jacks'^ lacks B." Within eagh 'col umn , the 
cerrditionai probabilities for the item 1 manifest states ('T^-j^and A^) must sum 
to^one, as must the conditional probabilities of ^2 and B^. Thjjp, i^n the 
ratrtx shown in Table 3, only two paramet'ers' ra^ther than three must be" 
specifi'ecj for,e?ch' column. ' ^ 

The ^fect of invoking the second simplifying assumption, yhat mtscla|si fi^ca- 
l^ion probabilities orrtv vary according to the presence >r aosence of an itms^'s 
full conpliment of required skiljs, is shown in Table A3. Note that in vWm ^ 
1 the same probabilities appear in the firs^t and second (lacks A) col ucM^^ and* 
in tne tmrd and fourth ' (has k) co.lurms. The presence or absenceW ski!"} B 
IS irrfeievant. Likewise-, jtem 2' factors are the saire for the firwand third 
colunrs, an dVor tne- second and'fourtn. In tne first row, for ex^ple, tne^ 
'seqon'd sinplifying assumption'ijiplies that PC^iI'mc) = PCT^'iITaB) = P(Xi!^, . 

f 

P{7:^!AB) = P{^^|AB) - PC^^^IA)-, P(A^^ = P(A^i^B) P{k^\Ti, and P{A^ jA5") = 
P(A^'|A3) = P(A^|A)-, Tae conditional prooabi li ti^s^for responses to Uem 2 

*ay^d B^) similarly simplified. Only four^^j^^Tues need be specified to 
determyie the entire matrix illCistrated in Tat)T% A3. One possible set woula 

be P(A^jX), P(A^|A)., PtB^IF), and PlB^ |B). 

\ 

At tj^his point, an algebraic simplification may be introduced. The fou/"-by- 
four^ma^r-i^ showfi in Table A3- turns out to be the Kronecker product of two^ 



Note tfiat Where the skill requirements of.two^r more items overlap, a single 
response pattern may include conflicting manifest classifications. For exam- 
ple, if a third item rea-^iring o«^Jkkill B were analyzed along with items 1 
and 2, one possible manifest resflfijVe wpuld be A^ B2 B3,' i.e. , state "lacks A" 
oh item 1, state "lacks B" on itSr2, stat^"has on item 3, Given latent 
state the prooability of this mani fest fState would be P(7Aii7<rj • P(F2l^ 
P(B3|AB), by t'he assumption of conditional independence. 



•TABLE A3^ 



-S1rp)ification of Mfsdassif ications Introduced by the Assumption of Invarlance Across Irrelevant S.ills 



Manifest State 



Ite-n 1 - Laeks A. 

lien 2 - Lac^s 3 (Ai|j^ 

Ite.Ti 1 - Lacics^ a; 
Ite:n 2 - Has B (^iBj) 



lte1h Has A. 
Ite-i 2 - Lacks B (A^Bz)* 
% 

Ite- 1 - Has A, 

Item 2 . Has 3 (AjB^) ' 



Lacks A, Lacks B (AB| 
* , t 

r(AitA) • P(BjlB) 

P(AijA) • P(B2JB) 



L A T E N T 5 T A T E 



L^cks A, Has B (AB) Has A, Lacks 3 (A3) ' Has A. has : (A:) 



P(mi !J0 • P(^2 IB) . P(^^i lA}.- P(52 ^3) 
P(9fl^) • PlBzlB) P(i^iiA) • P(3jIbj 



P(Ai!a) • P(B2jB) 
P(Ai|A) • P(B^|B), 



P(Ai lA) - PlSjlB) 
P(Ai|A) • P(52iB) 



P(IiiA) • ?(52'i3) 

P(AiiA) • ?;?:'3) 

P(Ai|A) • P(52;5) 
^ 
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two-by-two rratrices, each containing parameters for one item.^ These 
coroonent r.atri ces-. are showno'n Table A4. ^ Each is itself a misclassification 
matrix, with columns representing latent states and rows manifest states, 
the conditional probabilities in each column surrining to one, and the diagonal 
'representing correct classifications, ftote that, since tne two conditional 
probabilities in each column sum to one, specifying either value in a column 
determines the other. Thus the enti re ^matrix in Table *A3 can be specified 
given just one value from each colurrr of the matrices ip Table. A4. 

In tne general case, the misclassification matrix for any set of two or more 
items IS constructed by forming the Kronecker product of misclassification 
matrices fo.r the individual items. The dimensionality of tnese matrices 
depends upon the scoring used. Since in tnis study skill specification has 
focused on the correct response altornative only; items ^re scored dichoto- 
mously (correct/incorrect) an-<i misclassification matrices for individual 
items Kaye just ^wo rows and two columns. 

The CemPlete Modely 

The proportions of students in different latent states and the conditional 
probabilities in the "misclassi fication matri^s together determine the 
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The Kronecker product (direct product) has "wide application in formal algebra, 
and IS used in statistics to represent a variety of factorial structures 
(Bock, 1975 , pp. 273-283; Haberm^an, 1974, pp. 150-16^6). Eitner of these 
works provides a technical discussion. In tne absent context, to form the 
kronecker aroduct of the *two-by-two mis^olassi fication matrices for items 1 
and 2 and oDtain the four-by-four matrix snown in Table A3, the entire matrix 
for item 1 is multiplied in turn by e?ai element of the matrix for item 1 , , 
and the four resulting two-by-two matrices are adjoined ,in the same arrange- 
ment as' the elements from the item 1 matrix. 
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TABLE A^f"^3'"P^e5' of Ml sclassi f Ication Matrices for Two Ite-^s 



I te-^ 1 (Requires Skil 1 A) 

^ Latent St>ate* 



Kan i 'est St^te 



It^Ti 1 - Has A (A.,) 



Iter^ 1 (Requires Sk-.l B) 

Late-^t Ste te 



Lac^s A (a) Has A(A ) Mar^ifest State ^ ' Lacks 3(3 ) l-as 3(3 ) 



Itern 1 - Lac^s A (A^) P(Ai |A) 



P(A^|A) 



P(Aj |A) 
P(AilA) 



Ite-1 2 - Lacks 8 (Bp! 
Iten 2 - Has 8 (82) 



P(32lB) P(3j;8) 
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overall probability of each possible pattern of responses* To illustrate^ 
nurr.encal values will be chosen arbitrarily for the pararr.eters in the two- 
item exaniple, and the probabilities of each possible pattern of responses ^will 
be derived. These arbitrary values are presented in Table A5. Note that the 
hierarchical relationship between Skill A and Skill B has been assurried in • 
specifying the latent states. Note also that only six nurnerical values in 
Table 5 were freeTy chosen--all otners were obtained by subtractitDn. 

Four patterns of responses to items 1 and Z are possible. Using for 
correct and "0'' for incorrect, tnese are "00," "0#," "^0," and "^*.." Consider 
the .proDabi 1 i ty of a "00" response. For a student ir> latent state "lacks A» 
lacks B" tne probability of a "CO"*' response is .455C. For the "nas A, lacks 
B" state, tne conditional probability of a "00" response is .0325. For the 
"has A, has B*' state, it is .007?: — ^nce tne proportions of stucents in thes 
tnree states are .40, .50 and JO respectively, tne overall proDaoility of 
a "00" response is .40 x .4550 + .50 x .0325 * . 10 x .0075, or .1990. Simi- 
larly, the probabilities of the "0+," "-^0" and "-m-" response patterns are 
.1110, .4010, and*. 2890, respectively. 

In the saipe general fashion, overall probabilities of every possibie pattern 
of responses could oe computed for any' set of items, given the proportions in 
e-^ch ^latent state and the matrix'of misc/assification probabilities. 



\ 
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TA[3LE A5.--mustron ve Values 1'or l-Vi scl ass i fi cat i on Parameters in T\/o-Iteni Example 



Latent Strit^ 

Lacks A, Lacks B 
Lacks A, Has B 
Has A, Lacks B 
Hjjj^A, Hss B 



P roportion 
.40 

.10^ 



Iten^. 1 MisclassT fication V.:tr^x 



Latent Stale 



*- Kani fest State 

Lacks A 
Has ^ 



Lac^s A 

.70 
.30^ 



Has h 



.05 
.95' 



Item 2 Misclassi f icatton V.atrix 
La tent State 



Hani fest State 

Lacks B 
Has B 



Lacks B 



.66 



.35^ 



Has B 



.15 



J 



.85^ 
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TABLE Afi;-- CcnUnucd ' 



LATENT Sir 



K^nifcst State Locks A, L.icics b laa s A. iias b has A. Lac^s D ^ds A. Has 



Lacks A. Lacks B ^ .4550 -lOW .0325 .0075 

Lacks A. Has B .2450 .'5950 .0175 .0425 

Has A. Lacks B .1950 .0450 " .6175 .1425 

Has A. has B .T05Q ' - .2550 . ' .3325 .8075 



Corp\/ted Probc^bil i lies of Each Possitjlc f*omfcst ^esncnsc Pottern 

Mam fos t State for^ l t ?^ 
Manifest State for li^n 1 i tc" ^ - ic'zlz c : le'r. ^ - has b / 

Item 1 - Lacks A .1990 .1110 

Item 1 ' Has A .4010 .2890 



^Fixed by hierarchical constraint assumed for Shlls A and B. 

^This value \/as freely chosen. , Al 1 values not lettered v/cre obtained oy subtraction. 



APPENDIX B 



1 



Parameter Estimation, Hypothesis Testing, and 
Precision Assessment for thej^odels 



Edward Haertel 
Stanford University 



1^: 



Parameter Estimation, Hypothesis Testing, and 
Precision Estimation for the Models 

BJ* The Estimation Procedure 

Once a model for some set erf items has been formulated, numerical values can 
be estimated for the proportions of students in each latent state and fx)r the 
conditional probabilities in the classification matrices. This is accom- 
plished by the method of maximum likelihood. A de'tailed description of the 
procedure used is given in Murray '( 1971 ) ; briefly, it is as follows:- As 
illustrated ^bove (Appendix A), any set of values for the model parameters 
generates a set of probabilities that students will mark each of the possible 
(coded) response patterns. 'Since students' responses are assumed to be (con- 
ditionally) independent of one another, the probability that students in par- 
ticular ski n 'categories will respond in va^rious patterns is simply the product 
of their separate probabilities of so responding. Thus, using any set of 
parameter est>mates, we can compute the overall probability, or likelihood, of 
a set of observations. 

As an example, suppose that there were two items and four possible response 
patterns: wrong-wrong, wrong-right, right-wrong, and right-rignt, wnere 
rignt represents "has skill" affd wrong represents "lacks skill." Suppose -also 
that some set of parameter estimates generated probabilities of .4, .1, .3, 
and ►Z, respectively, for these patterns and that when ten students were 
tested, the frequencies in each pattern were 5, 0, 4, and 1. Probabilities 
of^these students responding as they did are .4 for each of the 5 "wrong- 
wrong" students, .3 for each of the 4 "right-wrong" students, and .2 for the 
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"ri gnt-ri gnt" student* The overall lij^elihood of the obtained cata, given the 
set of parameter estimates generating these probabi h ti es , is (.4) • (.3) • .2 
.000016589. As this llfc^elihood is a function of the parameter estirrates, any 
set of estirates will.yie^d a unique value for the likelihood. - The procedure 
in maxirrum likelihood estimation is to find a set of parameter estvriStes which 
rr^ximizes the value of this function, or, wnat usually done, minimizes the^ 
negative of the log of the function. Unaer soecifiaoie conaiiions, as the 
nurDer of respondents increases, this strategy will yield--wnh increasing 
prcbabi 1 1 ty--values wnich are unique ana wnicn have statistically cesirable 
features (Rao. 1965, pp. 2*89-302). • ^ J 

The rraximuTTi likelihood proceaure yielcs several useful statisMcs in addition 
to, the parameter estimates tnerrsel ves . These includ^ tne likelinocd ratio 
cni-square and tne asymptotic covariance matrix of the estimates, from wnich 
it is possible to corpute tneir standard ^rr^v^ (-^^c, 1965). Ir, larce sarr^p^es, 
like those used in this study, tnese statistics caT used to assess tne 
overall fit of tne rrodel, and to construct confidence irrtervals for the value 
of tne parameters. Th.e use of these statistics is further descriDed Delow, - 
in 1|he section on establishing criteria for goodness of fit. 

Finding tnt maximum 1 ike lihx>od 'estimates for a given model is, In technical 

terms, a linearly constrained non-linear function minimization proble'n. The * 

linear constraints are that all conditional probabilities and latent state 

'proportions must be between zero and one.'and that certaip subsets of these 

parameters must suni to^unity. The problem i$ non-linear because a given. 

Increment or decrement Ir the value of a particular parameter will produce 

V 
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quantitatively different changes in .response pattern probabilities, depending 
upon tne values of that parameter and others. An algorithm for solving- 
problems of this kind was published by Shanno (1970a, 1970b) and was imple- 
mented in the corrputer program used by Murray (1971). 'The same program, 
with 'minor niD^i fi cations , was us^ed in this stuay. Because the numoer of 
possible response patterns increases very rapidly with tne items considered, 
tne nuroer of items t>iat can be simultaneously analy;^ed is snarply limited. 
Experience with the program has indicated tnat models witn up to four items^ 
are coirpletely tractable. Models witn five items are roughly six times more 
costly to analyze, but do not exceed tne capacity of the program. Models 
with SIX it^ms ana no more than, tnree skills (eight latent states) can be 
analyzed, but only at substantial cost, and mocels witn more tnan six items 
cannot^e solved by the program in its present forr.. 

tstabiish^ng Criteria fo *^ Good ness of Fit i * 
ks described by Rao (1955), tne maximum likelinood estim^ation procedure yields, 
if tne model is valid, a likelihood rati^ cni-sqOare, which is asymptotically 

distributed as a chi-square on k-l-p degrees of freedom, where k 1s the > 
number of possible response patterns and p is tne number of non-redundant 
i^arameters^ estimated In fitting the model. The chi-squared fit statistic 



Certain sets of parameters must sum to unity, e.g., the^rcondi tional pro- 
babilities of a true positive and of a false negative on the sam^ item, 
or the probabilities of being in each possible >atent state. Since given 
all but one of the pa^ameters in such a set the last may be obtained by 
subtraction, each such set is said to contain one redundant parameter. 
/The choice of which parameter to regard as redundant is arbitrary. The 
number of non-redundant parameters is the number wnich could be freely 
cnosen. 
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assesses the li-kellhood of obtaining the observed data, given that the sta- 
tistkal model is a correct and complete representation of the process* gi vi ng 
rise to the data. A large value indicates large departures, of the observed 
data from likely values'given the model. Thus, in this applTcation a small 
chi-square is desirable. The size of the chi-square will also xJepend upon 
.the si"2e "of the sample used, since the likelihood of discrep^cies of a given 
size should be less if more persons are tested. That Is to say, if the model 
specified were correct and complete, as more and more persons were tested the 
ODserved proportions of persons m^anifesting each possible response pattern 
would ccme closer and closer to tne proportions oredicted by the model. 

The cni-squared test is sensitive to a_nv lack of corresportcence between the 
predicted and oDserved proportions for all response patterns. For purposes 
of this study, nowever, not all sources of such lack of fit are of eaual 
importance. Incom.plete or inaccurate specification of either the latent 
states or tne classification matrices may result in statistically significant 
lack of fit. Adequate specification of misclassifications -was investigated 
emf>?rically in a preliminary study (Haertel, 1980 ), and guidelines were de- 
veloped to minimize lack of fit due to mi sspecif ication of misclassifications. 
Incomplete modeling of latent states, however, may be inevitable. 

It, is to be expected that in our current state of knowledge, as it relates to 
the modeling of latent states, some skills will be omitted. The substantive 
model, which was implemented in our earlier work (Haertel, 1980 ), included 
nine skills, and was clearly simpler than the actual processes of reading 
comprehension? The mode] used in this study includes only one skill per 
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test level, an additional oversimplification of tha total set of processes, 
but one which apparently reflects what the testa aije capable of measuring. 

While the effects of an omitted skill required by only one of a set of 
items may be absorbed into the misclassif ication specification for that item 
any omitted skilTs common to two or more items and possessed by some but 
not all students will contribute to the lacK of fit. Some such effects can 
be avoided, ev^»,^by not using more than one^item from a passage* Others, 
nowever, wf 1 1 remain* The experience of researchers with models of this, kind 
has been tflit with samples as large as those to be used in this study, non- 
significant chi-squares are rarely obtained (e.g., see Murray, 1971; Proctor, 
1970)^ In this study, lack of fit may arise riot only as a consequence of 
dmitted reading comprehensibn skills, but also due to failure of the skills 
included to function as underlying dichotomies. In addition, substantively 
trivial departures from the predicted response pattern proportions may arise^ 
due to response biases on , the part of some children (e.g., a tendency to guess 
the fourth choicV), sex or racial/ethnic differences in the interest level 
of individual passages, or any other systematic influence upon the responses 
of a segment of the student. populat||^n, operating across items. As described 
below in the section on desigrt^f fects j data from stratified cl usteKsamples , 
like those used in this study, can only approximate the characteristics of a 
simple random sample. While an adjustment for this effect is made, it is 
necessarily imperfect, and departures from the theoretical assumption of 
simple random sampling may also perturb the fit- statistics in this study. ^ 

4 

The sensitivity of the overall chi-^square test t6 omitted skills and the ^ 
difficulty of obtaining non-significant chi-squares with large samples 



requires that additional criteria be established for ju(^^ing the iit^of 
.the models. One such cnterionS* that di screpanCies ^b^lperi the fitted 
and observed Response pattern proportions, i.e., residuals, be small. Cri- 
teria for the acceptable magnitude of residuals, established on the basis of 
two early sets of analyses, were used in^arlier analyses (Haertel , -1 980 ). 
In addition to simple differences between ' observed aad predicted proportions^ 
(raw residuals), a * s tandardi zed residual proposed by Cochran ( 1954) is 
emoloyea in establ i sVii ng tnese criteria. This standard'i zed residua'jl is 

f 

asymptoii cal ly distributed as a normal deviate with z^o mean and unit variance.. 
While It will increase wi^n sar^ple size in much tne same way as the likeli- 
nood ratio chi-sauare, it can provide information on wnetner lack of- fit is 
due to large residuals for a few cells (response patterns) or to moderate . 
residua'ls in. many cells. - In the former case, patterns of residuals can pro- 
vide ^feluable informiation on the SQurces of lack of fit, and can aid in re- 
vising tne mocfel to bring tne overall cm-square down. 

Testing 'Inflividual Parameters 
In this study, the major hypotheses ^ddressecl tne existence of specific skills' 
and that specified hierarchical rel ati-onshi ps held among them. These hypotheses 
can be formul ateJ -as .specifying that certain paran^eters are or^are not equal 
to zero. A ngorous procedure is available for testing hypotheses of<his 
fonr,. . • . 

If two skills are hierarchically related, no student should possess the second 
wno does not possess the first. .Thas, the proportions of students in any 
latent s'tates including the seconrf*ski 1 T but not the first should be zero* 



Thd .hypothesis that two^skills are^iiera^thically related is equivalent, 
therefore'; to a hypothesis 'that parameters representiog prof^rtions of 
students in latent stat-es oorrespjor\^i ng to certain combinatio)t^,t)f skill • 
states are equal to zero. If one of the skills used in defining the latent ^ 
states does not describe a difference among items an^^ng^^dents, then' 
pairs of .latent states differing baly with respect tcT^at skill may be 
collapsed. This i s ■ (mathemati cal ly) equivalent to setting the proportions ^ 
of students in all latent states for which that skill I's present (or .absent) ' 
to zero. Thus* the hypothesis that a 'given skill exists can be considered 
eguival^ to the hyppthesis tha.t the value is not zero for parameters rep- 
resenting proportidns of students .in at lea?t one-Jatent state including (or 
not including) that'skill. . ' • 

To test whether one or more parameters are zero.^wo. models are fitte^. 
the f.irst'model, the parameters to be tested are perrnitted to take on 

i 

valuer. The second'model is exactly like the f i rst, ^except that the parameters 
to be tested are forced equal^to zero. Since the ^econd model is simply 
tne first with certain constraints, it must necessarily yield a (^hi-square 
greater than or equal to that obtained with the first model. Ft will a>?o 
have more degrees of freedom— one more degree of freedom for each parameter - 
constrained to equal" zero. The arithmetic difference of the likelihood ratio 
Chi -squares for these two models is 4<nown as a difference chi -square . It is 
asympototically distrtouted as a 'chi -square on as many degrees of freedom 
there w6re additional constraints imposed in the second model. Even if- 
. the overall chi-squares -for the two^models are both significant, the dif- 
ference chi-square need not be. It tests the spec^ic hypothesis that the 
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specified paraneters are all equal to zero, und^ the assumption that the 
'other aspects of the model are correct. >^ - 

B.4. The Problem of Design Effects 

— — * ' « 

The theory on which maximum likelihood estimation and the associated statis- 
tics is based assumes ^simple random sample from the Dopulation of interest. 



Data to be used in this study, however, represent stratified cluster samples. 
In obtaining ^ach of these data sets, a universe of schools was first de- • 
fined, and all schools in. the universe were divided into strata according 
to size, location and other demographtc characteristics. Once strata were 
defined, some schools were randomly sampled within each stratum, and the 
• Students tested were all clustered within these selected schools. In com- 
parison to a simple random sample of students, stratification Ci)uld yield 
increased precision. The effett of clustering, however, is to reduce pre- 
ciS'ion." This is because observations on studerjts in the same school are 
correlated. Thus;* additional observations taken in the same school contain 
'less new information than observations on students selected at random from 
the population. In the data used ir>this study, the net effect of stratifi- 
•cation of schools and clustering of-students within schools w?s to decre^e ^ 
pr'ecision. As a result,, proportions of students manifesting different „ 
response patterns are not expected to approxiinate population proportions as 
closely as they wouTd in a simple random sample of the same size. While • 
this has no systematic effect on the parameter estimates, it results in^an 
inflation of the likelihood' ratio 'chi -square, a jeduction in the esti- 
mated standard errors jof the parameters. 




A STiPple niethod is used to adjust for this effect. ^ In practice the varia- 
bility of ,es*timates based on a stratified cluster sample of a given size is 
almost proportionaf to that of estimates based on a simpl^e random sample of 
tne s-ame^size, and very close to that of estimates from a simple random 
samplp of somewhat sma.ller size. The size of a simple random sample yielding 
the same precision asthe actual stratified cluster sample is called the 
effective sampte siz^ o By substituting the effective sample size for the 
• actual sample size in these analyses, the correct values of chi -squares -and 
standard errors can be approximated. Estimation of the effective sample 
size for tne data to be used in this study is in the main text (4.). ^ 
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In^rmediate Estimates for Scaling of 
^^%lative Skill Level Proportions 
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Disp l ay C.I . Logits of Estimated Latent State Probabvl i ties . 

Logits of Cumulative Latent State Prob abilities 
Item Set Grade .< B <C <1 < 2 



1 ,1.681 \ 

2 -0.995 -0.032 

3 ' ~ -0.695 -0.237 

4 . -0.22? 0.016 

5 * -0.900 ' -0.731 

6 -0.871 



1 . ' 1.319 

2 -0.781 -0.076 

3 -0.974 -0.044 

4 . -0.008 0.221 

5 •. -0.641 -0.443 
c ' . - -1.G30 
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Display C.2 . Estimated Variances and Covanances of Logits of Latent State 
Probability Estimates* 



l item Set ^rade 



Estimated DTSpersions (10-^ ) 
first variance covanance second variance 



0 

6 



1202.55 
13.09 

• 4.91 
12. D7 
14.86 
60.74 

119.66 
4.49 
5.20 
5.28 
5.95 
19.30 



4.66 
4.90 
8.68 
9.57 



1 .81 
3.26 
2.98 
3.24 



14.06 
7.72 
14.^6 
15.84, 



3.57 
12.21 
^•4.47 

4.64 



The-estimated covariance of the logits (f) of two probability estimates is 

1. \( 1 



cov(f(pi),f(Pp) 



Pi(i-p^);[P2(i-p^) 



cov(p^ ,p^; 



Disp lay C.3 . .Logistic Scale Values for Grades and States 

J . ' Item Set 

S^ale Value 'Parameter X Y 

Baseline A ^-^Sl 1.319 



Grade 
1 



•<.3 
US 



■ft 0.000 
-2.676 
-3.403 
-3.391 
-4.102 
-4.242 



o.ooc 

-2.100 
-2.998 
-2.962 
-3.61 1 
-4.198 



State 

<C 
<1 
<2 



r 



c.ooc 

1 .027 



0.000 
0.705 
1 .635 
1 .849 




